最近,用于语音处理的自我监督模型最近作为语音处理管道中流行的基础块出现。这些模型在未标记的音频数据上进行了预训练,然后用于语音处理下游任务,例如自动语音识别(ASR)或语音翻译(ST)。由于这些模型现在都用于研究和工业系统,因此有必要理解某些特征在培训数据中的性别分布等特征所引起的影响。我们以法语为我们的调查语言,训练和比较性别特定的WAV2VEC 2.0模型与在其预训练数据中包含不同性别平衡的模型。通过将这些模型应用于两个语音到文本下游任务:ASR和ST进行比较。结果显示了下游集成的类型。在微调端到端ASR系统之前,我们使用性别特定的预训练观察到较低的总体性能。但是,当将自我监督模型用作特征提取器时,总体ASR和ST结果遵循更复杂的模式,在这种模式下,平衡的预训练模型不一定会带来最佳结果。最后,我们粗制的“公平”度量标准(男性测试集之间测量的相对性能差异)并未显示出从平衡到特定性别的预训练的Preaded Wav2Vec 2.0模型的强烈变化。
translated by 谷歌翻译
对于新参与者 - 执行摘要:(1)任务是为语音数据开发语音匿名系统,该系统隐藏了说话者的语音身份,同时保护语言内容,副语言属性,清晰度和自然性。 (2)除3种不同的基线匿名系统,评估脚本和指标外,还提供了培训,开发和评估数据集。参与者应用其开发的匿名系统,运行评估脚本并向组织者提交客观评估结果和匿名语音数据。 (3)结果将在与Interspeech 2022结合的研讨会上展示,邀请所有参与者介绍其挑战系统并提交其他研讨会论文。对于熟悉语音挑战的读者 - 更改W.R.T. 2020年:(1)以自动扬声器验证(ASV)系统的形式进行了更强的半信息攻击模型,该系统接受了匿名(每位)语音数据的训练。 (2)互补指标包括等于误差率(EER)作为隐私指标,单词错误率(WER)作为主要实用性度量,以及音调相关性和声音独特性作为辅助效用度量标准。 (3)基于一组最小目标隐私要求的新排名策略。
translated by 谷歌翻译
能够收集用户声音的强大个人设备的广泛开设了建立语音识别系统(ASR)的扬声器或参与ASR的协作学习的机会。在这两种情况下,可以构建个性化的声学模型(AM),即微调AM与特定扬声器数据。自然出现的问题是,个性化声学模型的传播是否可以泄漏个人信息。在本文中,我们表明可以通过仅利用本地适应该扬声器的神经声学模型的重量矩阵变化来检索扬声器的性别,而且还可以检索扬声器的性别,而且还可以检索他的身份。顺便提及,我们观察到在语音处理的背景下可以有助于解释深度神经网络的现象。在使用中间层时,只能使用第一层和扬声器验证几乎肯定地识别性别。我们对具有HMM / TDNN模型的TED-Lium 3数据集的实验研究显示了性别检测的95%,并且通过仅利用可以交换的个性化模型的权重,扬声器验证任务的相同错误率为9.07%而不是用户数据。
translated by 谷歌翻译
本文调查了在自动语音识别(ASR)中有效地从个性化扬声器适应的神经网络声学模型(AMS)中检索扬声器信息。这个问题在联合学习的ASR声学模型的上下文中尤为重要,其中基于从多个客户端接收的更新在服务器上学习了全局模型。我们提出了一种方法来根据所谓指示器数据集的神经网络足迹分析神经网络AMS中的信息。使用此方法,我们开发了两个攻击模型,该模型旨在从更新的个性化模型推断扬声器身份,而无需访问实际用户的语音数据。TED-Lium 3语料库的实验表明,所提出的方法非常有效,可以提供1-2%的相同错误率(eer)。
translated by 谷歌翻译
本文介绍了第一个致力于2020挑战的结果和分析,重点是开发语音技术的匿名解决方案。我们提供了对提交的系统和评估结果的分析,提供了挑战设计的系统概述。特别是,我们描述了用于系统开发和评估的语音匿名任务和数据集。此外,我们呈现不同的攻击模型和相关目标和主观评估指标。我们介绍了两个匿名化的基线,并提供了由挑战参与者开发的匿名化系统的摘要描述。我们向基线和提交的系统报告客观和主观评估结果。此外,我们提出了作为评估后分析的一部分开发的替代隐私度量和攻击模型的实验结果。最后,我们总结了我们的见解和观察,这将影响下一个语音普遍挑战版的设计和未来语音匿名化研究的某些方向。
translated by 谷歌翻译
匿名化具有操纵语音信号的目标,以便降解扬声器识别的自动方法的可靠性,同时保留语音的其他方面,例如与可懂度和自然有关的那些。本文报告了一种对匿名化的方法,与其他电流方法不同,不需要培训数据,是基于众所周知的信号处理技术,并且既有效又有效。所提出的解决方案使用MCADAMS系数来转换语音信号的光谱包络。使用常见的ove voiceprivacy的结果2020数据库和协议显示随机,优化的转换可以在匿名方面优于竞争解决方案,同时只导致适度,额外的劣化,即使在半通知隐私对手的情况下也是如此。
translated by 谷歌翻译
Most benchmarks for studying surgical interventions focus on a specific challenge instead of leveraging the intrinsic complementarity among different tasks. In this work, we present a new experimental framework towards holistic surgical scene understanding. First, we introduce the Phase, Step, Instrument, and Atomic Visual Action recognition (PSI-AVA) Dataset. PSI-AVA includes annotations for both long-term (Phase and Step recognition) and short-term reasoning (Instrument detection and novel Atomic Action recognition) in robot-assisted radical prostatectomy videos. Second, we present Transformers for Action, Phase, Instrument, and steps Recognition (TAPIR) as a strong baseline for surgical scene understanding. TAPIR leverages our dataset's multi-level annotations as it benefits from the learned representation on the instrument detection task to improve its classification capacity. Our experimental results in both PSI-AVA and other publicly available databases demonstrate the adequacy of our framework to spur future research on holistic surgical scene understanding.
translated by 谷歌翻译
Video provides us with the spatio-temporal consistency needed for visual learning. Recent approaches have utilized this signal to learn correspondence estimation from close-by frame pairs. However, by only relying on close-by frame pairs, those approaches miss out on the richer long-range consistency between distant overlapping frames. To address this, we propose a self-supervised approach for correspondence estimation that learns from multiview consistency in short RGB-D video sequences. Our approach combines pairwise correspondence estimation and registration with a novel SE(3) transformation synchronization algorithm. Our key insight is that self-supervised multiview registration allows us to obtain correspondences over longer time frames; increasing both the diversity and difficulty of sampled pairs. We evaluate our approach on indoor scenes for correspondence estimation and RGB-D pointcloud registration and find that we perform on-par with supervised approaches.
translated by 谷歌翻译
A quantitative assessment of the global importance of an agent in a team is as valuable as gold for strategists, decision-makers, and sports coaches. Yet, retrieving this information is not trivial since in a cooperative task it is hard to isolate the performance of an individual from the one of the whole team. Moreover, it is not always clear the relationship between the role of an agent and his personal attributes. In this work we conceive an application of the Shapley analysis for studying the contribution of both agent policies and attributes, putting them on equal footing. Since the computational complexity is NP-hard and scales exponentially with the number of participants in a transferable utility coalitional game, we resort to exploiting a-priori knowledge about the rules of the game to constrain the relations between the participants over a graph. We hence propose a method to determine a Hierarchical Knowledge Graph of agents' policies and features in a Multi-Agent System. Assuming a simulator of the system is available, the graph structure allows to exploit dynamic programming to assess the importances in a much faster way. We test the proposed approach in a proof-of-case environment deploying both hardcoded policies and policies obtained via Deep Reinforcement Learning. The proposed paradigm is less computationally demanding than trivially computing the Shapley values and provides great insight not only into the importance of an agent in a team but also into the attributes needed to deploy the policy at its best.
translated by 谷歌翻译
Transfer learning on edge is challenging due to on-device limited resources. Existing work addresses this issue by training a subset of parameters or adding model patches. Developed with inference in mind, Inverted Residual Blocks (IRBs) split a convolutional layer into depthwise and pointwise convolutions, leading to more stacking layers, e.g., convolution, normalization, and activation layers. Though they are efficient for inference, IRBs require that additional activation maps are stored in memory for training weights for convolution layers and scales for normalization layers. As a result, their high memory cost prohibits training IRBs on resource-limited edge devices, and making them unsuitable in the context of transfer learning. To address this issue, we present MobileTL, a memory and computationally efficient on-device transfer learning method for models built with IRBs. MobileTL trains the shifts for internal normalization layers to avoid storing activation maps for the backward pass. Also, MobileTL approximates the backward computation of the activation layer (e.g., Hard-Swish and ReLU6) as a signed function which enables storing a binary mask instead of activation maps for the backward pass. MobileTL fine-tunes a few top blocks (close to output) rather than propagating the gradient through the whole network to reduce the computation cost. Our method reduces memory usage by 46% and 53% for MobileNetV2 and V3 IRBs, respectively. For MobileNetV3, we observe a 36% reduction in floating-point operations (FLOPs) when fine-tuning 5 blocks, while only incurring a 0.6% accuracy reduction on CIFAR10. Extensive experiments on multiple datasets demonstrate that our method is Pareto-optimal (best accuracy under given hardware constraints) compared to prior work in transfer learning for edge devices.
translated by 谷歌翻译